Add data split mode to DMatrix MetaInfo #8568

rongou · 2022-12-07T17:51:07Z

We need this information to change the behavior of the predictor when data is split column-wise.

Part of #8424

rongou · 2022-12-07T17:51:19Z

trivialfis

Do we still need the one in learner if DMatrix contains all the information?

rongou · 2022-12-07T19:05:30Z

Hmm, if we don't declare a training parameter, how would the user specify it? I imagine if someone wants to do column-wise split, they'd do something like:

param = { 'tree_method': 'gpu_hist', data_split_mode: 'col'}
xgb.train(param, ...)

trivialfis · 2022-12-07T20:52:44Z

If the DMatrix contains the information the user doesn't need to specify that parameter right?

rongou · 2022-12-07T21:08:56Z

Yeah I guess the question is whether to specify it through dmatrix or as a training parameter.

dtrain = xgb.DMatrix('very-wide.txt.train', data_split_mode='col')
dtest = xgb.DMatrix('very-wide.txt.test', data_split_mode='col')
param = {'objective': 'binary:logistic', 'tree_method': 'gpu_hist'}
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 500
xgb.train(param, dtrain, num_round, evals=watchlist)

vs.

dtrain = xgb.DMatrix('very-wide.txt.train')
dtest = xgb.DMatrix('very-wide.txt.test')
param = {'objective': 'binary:logistic', 'tree_method': 'gpu_hist', 'data_split_mode': 'col'}
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 500
xgb.train(param, dtrain, num_round, evals=watchlist)

Which one do you think it's more natural or user friendly?

rongou · 2022-12-08T21:36:30Z

@trivialfis this is what it looks like if we take out the dsplit parameter out of training and put it in data loading. Let me know what you think. Still need to fix some downstream calls.

trivialfis · 2022-12-09T18:46:26Z

I think the DMatrix one is better.

A little bit of background here, @rongou shared some potential changes with me offline for federated learning. One important factor is how we handle the data split parameter in the predictor. The original plan was to pass it as part of the learner train param or learner model param. We later thought it might be possible to make it part of the DMatrix, and all algorithms will dispatch based on the DMatrix it receives, which from my perspective is cleaner than the learner option.

cc @RAMitchell .

rongou · 2022-12-13T01:50:49Z

@trivialfis Turns out this actually solves a problem we talked about before. If the user has done some preprocessing to split the training data beforehand, now they can pass in data_split_mode='none' to avoid further splitting.

Other than that bit of additional functionality, this PR preserves the existing behavior. Do you think we should merge it? The CI should pass now.

trivialfis

I agree on using DMatrix as the parameter holder. Some other questions in comments.

include/xgboost/c_api.h

rongou · 2022-12-20T00:07:40Z

@trivialfis @hcho3 I think this is close to what we want. Please take another look. Thanks!

trivialfis

The implementation looks good to me. Excellent work! Some questions regarding the interface and the scope of the new parameter.

trivialfis · 2022-12-20T10:02:45Z

include/xgboost/c_api.h

@@ -126,12 +126,29 @@ XGB_DLL int XGBGetGlobalConfig(char const **out_config);

 /*!
 * \brief load a data matrix
+ * \deprecated since 1.7.3


I think the next big one would be 2.0 unless something major comes up.

trivialfis · 2022-12-20T10:05:50Z

include/xgboost/c_api.h

+ * \param out a loaded data matrix
+ * \return 0 when success, -1 when failure happens
+ */
+XGB_DLL int XGDMatrixCreateFromFileV2(char const *config, DMatrixHandle *out);


Have you considered making it a proper API for URI? For instance FromURI? Since users actually need to use some of the URI formats like file.txt?format=csv. XGBoost/dmlc-core doesn't guess the format and is a source of bugs when users pass in only the file name.

Also, do you plan to introduce the need_split with other input sources as well? If not, we can make it an URI parameter and limit its use to this function. Otherwise, we need to have an additional parameter in all language bindings.

Changed to URI.

For need_split, we can get rid of it if we are just going to preserve the current behavior, which is to split for distributed training, no otherwise. If a user wants more flexibility, they can always use another api to load the data. For distributed training, most people are probably not using the file api anyway.

Removed it as a parameter.

rongou · 2022-12-21T19:07:43Z

@trivialfis finally got the CI to pass. PTAL

trivialfis

Thank you for the work on preparing the DMatrix for column split!

Add data split mode to DMatrix MetaInfo

40252ee

trivialfis reviewed Dec 7, 2022

View reviewed changes

rongou added 4 commits December 8, 2022 11:39

Merge remote-tracking branch 'upstream/master' into data-split-param

7c35c40

remove dsplit training param

26ed1a9

fix dmatrix validation

d3fda24

fix python

8e797f7

rongou added 8 commits December 12, 2022 10:34

Merge remote-tracking branch 'upstream/master' into data-split-param

e12f361

fix dsplit for local mode

8f7ac3e

fix java bulid

fa7a670

fix R package

afc5fa0

fix demo

31b7112

fix line too long

32d7fcc

fix r doc

c857cd9

update roxgen

aa0c26c

Merge remote-tracking branch 'upstream/master' into data-split-param

cbd1a42

trivialfis reviewed Dec 13, 2022

View reviewed changes

include/xgboost/c_api.h Outdated Show resolved Hide resolved

rongou added 7 commits December 14, 2022 17:49

Merge remote-tracking branch 'upstream/master' into data-split-param

d7830cb

Merge remote-tracking branch 'upstream/master' into data-split-param

c9ee1d6

add XGDMatrixCreateFromFileV2

6782dd9

add a test for v2

86226e0

Merge remote-tracking branch 'upstream/master' into data-split-param

914df2a

add need_split to json config

bde1e4c

Merge remote-tracking branch 'upstream/master' into data-split-param

55f8aa4

trivialfis reviewed Dec 20, 2022

View reviewed changes

rongou added 6 commits December 20, 2022 10:47

Merge remote-tracking branch 'upstream/master' into data-split-param

9002705

change to uri

c80a3ae

remove need_split as a parameter

58ae574

fix python

f6148a3

fix dask test

da7d545

Merge remote-tracking branch 'upstream/master' into data-split-param

417dc18

rongou mentioned this pull request Dec 21, 2022

Vertical Federated Learning RFC #8424

Open

trivialfis approved these changes Dec 24, 2022

View reviewed changes

trivialfis merged commit 3ceeb8c into dmlc:master Dec 25, 2022

rongou deleted the data-split-param branch September 25, 2023 16:42

ShellLM mentioned this pull request Aug 11, 2024

Xgboost 2.0.0 · dmlc/xgboost irthomasthomas/undecidability#878

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data split mode to DMatrix MetaInfo #8568

Add data split mode to DMatrix MetaInfo #8568

rongou commented Dec 7, 2022 •

edited

Loading

rongou commented Dec 7, 2022

trivialfis left a comment

rongou commented Dec 7, 2022

trivialfis commented Dec 7, 2022

rongou commented Dec 7, 2022

rongou commented Dec 8, 2022

trivialfis commented Dec 9, 2022

rongou commented Dec 13, 2022

trivialfis left a comment

rongou commented Dec 20, 2022

trivialfis left a comment •

edited

Loading

trivialfis Dec 20, 2022

rongou Dec 20, 2022

trivialfis Dec 20, 2022

rongou Dec 20, 2022

rongou Dec 20, 2022

rongou commented Dec 21, 2022

trivialfis left a comment

Add data split mode to DMatrix MetaInfo #8568

Add data split mode to DMatrix MetaInfo #8568

Conversation

rongou commented Dec 7, 2022 • edited Loading

rongou commented Dec 7, 2022

trivialfis left a comment

Choose a reason for hiding this comment

rongou commented Dec 7, 2022

trivialfis commented Dec 7, 2022

rongou commented Dec 7, 2022

rongou commented Dec 8, 2022

trivialfis commented Dec 9, 2022

rongou commented Dec 13, 2022

trivialfis left a comment

Choose a reason for hiding this comment

rongou commented Dec 20, 2022

trivialfis left a comment • edited Loading

Choose a reason for hiding this comment

trivialfis Dec 20, 2022

Choose a reason for hiding this comment

rongou Dec 20, 2022

Choose a reason for hiding this comment

trivialfis Dec 20, 2022

Choose a reason for hiding this comment

rongou Dec 20, 2022

Choose a reason for hiding this comment

rongou Dec 20, 2022

Choose a reason for hiding this comment

rongou commented Dec 21, 2022

trivialfis left a comment

Choose a reason for hiding this comment

rongou commented Dec 7, 2022 •

edited

Loading

trivialfis left a comment •

edited

Loading